Search CORE

24 research outputs found

Approaches to Automatic Text Structuring

Author: Erbs Nicolai
Publication venue
Publication date: 11/09/2015
Field of study

Structured text helps readers to better understand the content of documents. In classic newspaper texts or books, some structure already exists. In the Web 2.0, the amount of textual data, especially user-generated data, has increased dramatically. As a result, there exists a large amount of textual data which lacks structure, thus making it more difficult to understand. In this thesis, we will explore techniques for automatic text structuring to help readers to fulfill their information needs. Useful techniques for automatic text structuring are keyphrase identification, table-of-contents generation, and link identification. We improve state of the art results for approaches to text structuring on several benchmark datasets. In addition, we present new representative datasets for users’ everyday tasks. We evaluate the quality of text structuring approaches with regard to these scenarios and discover that the quality of approaches highly depends on the dataset on which they are applied. In the first chapter of this thesis, we establish the theoretical foundations regarding text structuring. We describe our findings from a user survey regarding web usage from which we derive three typical scenarios of Internet users. We then proceed to the three main contributions of this thesis. We evaluate approaches to keyphrase identification both by extracting and assigning keyphrases for English and German datasets. We find that unsupervised keyphrase extraction yields stable results, but for datasets with predefined keyphrases, additional filtering of keyphrases and assignment approaches yields even higher results. We present a de- compounding extension, which further improves results for datasets with shorter texts. We construct hierarchical table-of-contents of documents for three English datasets and discover that the results for hierarchy identification are sufficient for an automatic system, but for segment title generation, user interaction based on suggestions is required. We investigate approaches to link identification, including the subtasks of identifying the mention (anchor) of the link and linking the mention to an entity (target). Approaches that make use of the Wikipedia link structure perform best, as long as there is sufficient training data available. For identifying links to sense inventories other than Wikipedia, approaches that do not make use of the link structure outperform the approaches using existing links. We further analyze the effect of senses on computing similarities. In contrast to entity linking, where most entities can be discriminated by their name, we consider cases where multiple entities with the same name exist. We discover that similarity de- pends on the selected sense inventory. To foster future evaluation of natural language processing components for text structuring, we present two prototypes of text structuring systems, which integrate techniques for automatic text structuring in a wiki setting and in an e-learning setting with eBooks

TUbiblio

tuprints

Approaches to Automatic Text Structuring

Author: Erbs Nicolai
Publication venue
Publication date: 11/09/2015
Field of study

TUbiblio

Hierarchy Identification for Automatically Generating Table-of-Contents

Author: Erbs Nicolai
Gurevych Iryna
Zesch Torsten
Publication venue: INCOMA Ltd.
Publication date: 01/09/2013
Field of study

A table-of-contents (TOC) provides a quick reference to a document’s content and structure. We present the first study on identifying the hierarchical structure for automatically generating a TOC using only textual features instead of structural hints e.g. from HTML-tags. We create two new datasets to evaluate our approaches for hierarchy identification. We find that our algorithm performs on a level that is sufficient for a fully automated system. For documents without given segment titles, we extend out work by auto matically generating segment titles. We make the datasets and our experimental framework publicly available in order to foster future research in TOC generation

TUbiblio

Sense Similarity

Author: Erbs Nicolai
Gurevych Iryna
Zesch Torsten
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

Sense and Similarity: A Study of Sense-level Similarity Measure

Crossref

TUdatalib Repository (TU Darmstadt)

Link Discovery: A Comprehensive Analysis

Author: Erbs Nicolai
Gurevych Iryna
Zesch Torsten
Publication venue
Publication date: 01/01/2011
Field of study

We present a comprehensive analysis of link discovery approaches. We classify them with regard to the type of knowledge being used, and identify three commonly used sources of knowledge: The text of a document, the document title, and already existing links. We analyze the influence of the knowledge source as well as of the amount of training data used. Results show that the link-based approach performs best if the amount of training data is huge. In a more realistic setting with fewer training data, the text-based approach yields better results

TUbiblio

Crossref

Sense and Similarity: A Study of Sense-level Similarity Measures

Author: Erbs Nicolai
Gurevych Iryna
Zesch Torsten
Publication venue: Association for Computational Linguistics and Dublin City University
Publication date: 01/01/2014
Field of study

Sense and Similarity: A Study of Sense-level Similarity Measure

TUbiblio

Crossref

TUdatalib Repository (TU Darmstadt)

Bringing Order to Digital Libraries: From Keyphrase Extraction to Index Term Assignment

Author: Erbs Nicolai
Gurevych Iryna
Rittberger Marc
Publication venue
Publication date: 01/09/2013
Field of study

TUbiblio

Hierarchy Identification

Author: Erbs Nicolai
Gurevych Iryna
Zesch Torsten
Publication venue
Publication date: 01/01/2013
Field of study

The page list data sets and experiments presented in the paper Hierarchy Identification for Automatically Generating Table-of-Contents

TUdatalib Repository (TU Darmstadt)

First Aid for Information Chaos in Wikis: Collaborative Information Management Enhanced Through Language Technology

Author: Bär Daniel
Erbs Nicolai
Gurevych Iryna
Zesch Torsten
Publication venue: VWH
Publication date: 01/01/2011
Field of study

TUbiblio

Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

Author: Daniel Bär
Iryna Gurevych
Nicolai Erbs
Torsten Zesch
Publication venue
Publication date: 01/06/2011
Field of study

www.ukp.tu-darmstadt.de We present Wikulu 1, a system focusing on supporting wiki users with their everyday tasks by means of an intelligent interface. Wikulu is implemented as an extensible architecture which transparently integrates natural language processing (NLP) techniques with wikis. It is designed to be deployed with any wiki platform, and the current prototype integrates a wide range of NLP algorithms such as keyphrase extraction, link discovery, text segmentation, summarization, or text similarity. Additionally, we show how Wikulu can be applied for visually analyzing the results of NLP algorithms, educational purposes, and enabling semantic wikis.

CiteSeerX

TUbiblio